Comparing machine learning algorithms for multimorbidity prediction: An example from the Elsa-Brasil study

Background Multimorbidity is a worldwide concern related to greater disability, worse quality of life, and mortality. The early prediction is crucial for preventive strategies design and integrative medical practice. However, knowledge about how to predict multimorbidity is limited, possibly due to the complexity involved in predicting multiple chronic diseases. Methods In this study, we present the use of a machine learning approach to build cost-effective multimorbidity prediction models. Based on predictors easily obtainable in clinical practice (sociodemographic, clinical, family disease history and lifestyle), we build and compared the performance of seven multilabel classifiers (multivariate random forest, and classifier chain, binary relevance and binary dependence, with random forest and support vector machine as base classifiers), using a sample of 15105 participants from the Brazilian Longitudinal Study of Adult Health (ELSA-Brasil). We developed a web application for the building and use of prediction models. Results Classifier chain with random forest as base classifier performed better (accuracy = 0.34, subset accuracy = 0.15, and Hamming Loss = 0.16). For different feature sets, random forest based classifiers outperformed those based on support vector machine. BMI, blood pressure, sex, and age were the features most relevant to multimorbidity prediction. Conclusions Our results support the choice of random forest based classifiers for multimorbidity prediction.

Introduction Multimorbidity, usually defined as the presence of two or more chronic diseases, represents a huge challenge for health systems all over the world [1,2]. People with multiple chronic diseases commonly experienced loss of functional abilities, more frequent and longer hospitalization, and increased risk of premature death [1][2][3]. Traditional medical practices based on diagnosis and treatment of diseases in isolation highlight the limitations in health systems for the treatment of multimorbidity [4].
Although multimorbidity is a challenge worldwide, there are differences between low and middle-income countries (LMIC´s) and high-income countries (HIC´s) [3,5,6]. Multimorbidity affects more young people in LMIC´s than in HIC´s and represents a large part of the burden of disease in those countries [5][6][7]. The different profiles imply the need for different preventive approaches and continued treatment. In this sense, the knowledge of multimorbidity patterns and factors related, especially the modifiable factors, can help public policies for the prevention and treatment of multiple chronic diseases.
Studies on multimorbidity patterns in low and middle-income countries (LMIC´s) are still sparse [8]. Particularly, some studies have already described the most prevalent dyads and triads of chronic conditions and identified socio-economic factors as related to multimorbidity, but most studies deal with multimorbidity as a count of diseases [9,10]. Given the diversity of cooccurrences of chronic diseases, the number of diseases does not allow expressing the variety of possible multimorbidity patterns as well as related factors.
Recently, machine learning (ML) methods have been applied to many problems in public health to improve prediction and discover complex patterns of relationships not found by traditional methods [11]. Studies using unsupervised ML techniques have been published for the recognition of multimorbidity patterns, but there is still a gap of studies about modeling and prediction of multiple chronic diseases, possibly due to the complexity of establishing the classification of multiple diseases simultaneously [12][13][14].
The main challenges for multimorbidity prediction, not yet jointly addressed in the multiple disease prediction scenario, are: to incorporate the relationship among diseases in the prediction; to deal with the imbalanced and low occurrence of diseases; and to find a set of features that are low-cost, to be able to predict with reasonable accuracy the occurrence of multiple chronic conditions [12][13][14][15][16][17]. The contributions of this work are present a ML approach based on multilabel classifiers to deal with these problems, and apply it to the build and comparison of cost-effective multimorbidity prediction models, using a large sample from the Brazilian Longitudinal Study of Adult Health (ELSA/Brasil). We provided a comparison of the performance of seven multilabel classifiers and identified the features that are most relevant in the prediction.
To encourage the use of multilabel classifiers, we developed a web application available online at https://danielapaula.shinyapps.io/Multilabel_Tool/, where prediction models can be built and used practically and intuitively. The application can be used by the general public since it does not require prior knowledge of machine learning or programming.

Study population
ELSA-Brasil is a multicenter cohort study conducted in six Brazilian capitals. The main objective is to investigate the incidence and risk factors associated with cardiovascular diseases and diabetes. The baseline was performed from 2008 to 2010 and enrolled 15105 active or retired civil servants (35-74 years). Participants were submitted to a set of examinations, in addition to detailed personal interviews by trained personnel. The sampling procedures and study design of ELSA-Brasil have been reported previously [18,19].
ELSA-Brasil was approved by the National Research Ethics Committee (Conep-No. 976), and the research protocol developed at all Research Centers was approved by the Research Ethics Committee of the Oswaldo Cruz Institute (CEP Fiocruz/IOC-No. 343/06). All participants gave written consent to participate.

Study variables
Outcome variables. Chronic diseases with prevalence greater than or equal to 5% at baseline were included. Multimorbidity was defined as the presence of at least two chronic diseases in the set of 10 morbidities: cancer, diabetes, dyslipidemia, common mental disorder, migraine, heart disease (acute myocardial infarction, angina pectoris or heart failure), asthma, cirrhosis, joint problems, and kidney disease.
Cancer, heart disease, asthma, cirrhosis, joint problems, and kidney disease were identified by the participant's self-report of a previous diagnosis by a physician. Dyslipidemia was identified by clinical exam. Diabetes was defined by self-report or use of medication and when not reported, it was defined by clinical exam [20][21][22][23][24][25].
Common mental disorders were identified using the Clinical Interview Schedule-Revised (CIS-R) instrument (cut-off point for the presence of a disorder � 12 points) [26]. Migraine was defined according to a detailed headache questionnaire based on the International Headache Society (IHS) criteria [24].
Predictors variables. The choice of the set of predictors was based on previously published predictive models for chronic diseases [3,[26][27][28]. We considered 31 variables that do not imply additional costs beyond those easily accessible by physicians in a clinical visit. The variables were defined as follows: Sociodemographic variables: sex, age (in years), education (never attended to school to elementary, secondary, undergraduate and postgraduate), self-reported race/color (based on Brazilian Census: white, "pardo," black, indigenous and Asian), marital status (single or not single), per capita household income (in dolar), children (yes/no), maternal education (never attended to school, incomplete elementary, elementary, secondary and undergraduate).
Lifestyle and dietary variables: smoking (never, past, or current), consumption of alcohol (not consuming, moderate-consumption <210 and 140 g/wk for men and women, respectively, excessive-higher than previous consumption limits) [29], physical activity in leisure time (weak, moderate, and strong following the classification of the International Physical Activity Questionnaire, in the domain of leisure-time physical activity) [30]. Days of physical activity (number of days/week) [28]. Sleep symptom (yes/no). Sleeping problem (yes/no) [28]. Fruit consumption and vegetable consumption (high, daily, weekly and rarely) [31]. Coffee consumption (no; yes, with caffeine; yes, decaffeinated) [28].

Statistical analysis
Descriptive analysis. The analytical sample considered 14836 participants who provided information about the chronic diseases considered. Descriptive analysis was performed. Participants were grouped according to the number of chronic conditions (0,1,2,3,4, 5 or more), and differences among the groups were tested by Pearson's chi-square test, analysis of variance, and Kruskal-Wallis test. The multimorbidity patterns were identified and the prevalence analyzed [32]. The level of significance was set at p < .05. The R 4.1.0 software was used in all analyses.
ML algorithms. i. Incorporating the relationship between the diseases. Multilabel classifiers approach the issue of incorporating the relationship between the diseases in the model design. They provide different ways of establishing the relationship between the diseases and can be divided into two main categories: Algorithm adaptation and problem transformation. In this work, seven multilabel classifiers were implemented and their performances compared. One algorithm adaptation method (multivariate random forest) and three problem transformation methods (Binary Relevance, Dependent Binary Relevance, and Classifier Chain), with support vector machine (SVM) and random forest (RF) as base classifiers for each one of transformation methods [33,34]. These methods differ in the way they establish the relationship between the diseases for prediction: Multivariate random forest (MTV-RF)-The relationship between the labels is established by a composite normalized Gini index splitting rule, which uses a weighted covariance structure (e.g., auto regressive, compound symmetry) to assign the relationship between the labels.
Binary relevance (BR)-The simplest problem transformation method that implements a binary classifier for each label. The labels are predicted independently of each other and label dependencies are not taken into account.
Dependence Binary relevance (DBR)-The multilabel classification is transformed into simple binary classifications for each label, as well as in the BR method, but the dependence between the labels is established using for each label the actual information of all binary labels (except the target outcome) as additional features.
Classifier Chain (CC)-A binary classifier is trained for each label following a given order. The dependence between the labels is designed by including in the feature space of each classifier the true label information of all previous labels in the chain.
The choice of classifiers was based on scalability for large datasets and performance on prediction problems [17][18][19].
For model evaluation, we consider the performance measures usually considered for multilabel classifiers: Hamming loss, subset accuracy, accuracy, and F-measure, defined as follows [34]: Let D be a multilabel dataset, |D| the number of observed instances (for example, the number of participants in the study), L the full set of labels in D, and |L| the number of labels (for example, the number of diseases considered). For Y i the label set of i-th instance (observed diseases for the i-th participant), and Z i the subset of predicted labels (predicted diseases for i-th participant), we define: Hamming Loss: It is the most common evaluation metric in the multilabel literature, computed as the symmetric difference between predicted and true labels (predicted diseases that were not observed or observed diseases that were not predicted) divided by the total number of labels.
where Δ is the operator that returns the symmetric difference between Y i , the label set of the ith instance, and Z i , the predicted one. The |.| operator counts the number of 1's in this difference, in other words the number of miss predictions. Subset Accuracy: This metric is also known as 0/1 Subset Accuracy and Classification Accuracy, and it is the most strict evaluation metric. The[[]] denotes de Iverson bracket, which returns 1 if the expression inside it is true or 0 otherwise. In this case, its value is 1 only if the predicted set of labels equals the true one.
] returns 1 if all labels for i-th instance are equal to the predicted ones. Accuracy: It is defined as the proportion of correctly predicted labels concerning the total number of labels (predicted or observed) for each instance.
where |Y i \Z i | is the number of correctly predicted labels, and |Y i [Z i | is the total number of active labels, in the both real label set and the predicted one. F-Measure: This metric is the harmonic mean between Precision and Recall, providing a balanced assessment between precision and sensitivity. where , with |Y i |, the total number of truly relevant labels, and | Y i \Z i | defined as in Eq (3). (6), with |Z i |, the total number of predicted labels, and |Y i \Z i | defined as in Eq (3).
ii. Dealing with imbalance. To evaluate the imbalance of disease occurrences we used cardinality, density, and IRLbl (Imbalance ratio per Label). These measures characterize the dataset and can influence the classifiers, they are defined as follows [14,35]: Let D be a multilabel dataset, |D| the number of observed instances (for example, the number of participants in the study), L the full set of labels in D, and |L| the number of labels (for example, the number of diseases considered). For Y i the label set of i-th instance (observed diseases for the i-th participant) in D, we define: Label Cardinality: is the average number of labels of the observations in a dataset D: where |Y i | is the total number of truly relevant labels for the i-th instance.
Label Density: is the average number of labels of the observations in a dataset D divided by | L|: IRLbl (Imbalance ratio per Label): where the symbol [[]] denotes the Iverson bracket, which returns 1 if the expression inside it is true or 0 otherwise, y denotes the label for which the measure IRLbl(y) is calculated, y´denotes a label evaluated in L (y´2L), and max is the maximal value.
IRLbl is a measure calculated individually for each label and represents the maximum observed occurrence in the set of labels divided by the occurrence of each label (the highest observed occurrence of a disease divided by the occurrence of each disease). The higher is the IRLbl the larger would be the imbalance, allowing to know which labels are in minority or majority. MeanIR is the average IRLbl for an MLD. It is useful to estimate the global imbalance level [14,35].
Due to the imbalance observed in the dataset regarding the chronic conditions, the performance measures were averaged over stratified 10-fold cross-validation (CV), repeated 5 times. In each iteration, the algorithms were then trained in turn on nine partitions and evaluated on the remaining partition. We consider a nested 3-fold cross-validation for hyperparameter tuning. The methods can handle missing values internally, except for the SVM-based algorithms, for which the imputation was performed within the CV, on the test and training partitions separately, to avoid bias in performance estimation. Technical details are given in the appendix (SMethods1 in S1 Appendix).
A possible strategy to deal with imbalance is to consider a resampling algorithm in model design. To evaluate the impact of resampling algorithms, we applied the random oversample based on the IRLbl [35]. The oversample was applied on each partition of the training set generated during the stratified cross-validation procedure [18,36,37].
iii. Searching for low-cost predictors. To consider searching for low cost features, we used a feature selection technique based on information gain. Variable importance was estimated for the MTV-RF model. We analyzed the effect of the number of features in classification performance using the Binary Relevance + Information Gain (BR+IG) approach [16,38]. This feature selection consists of first transforming the multilabel data into single-label datasets, by Binary Relevance, and then using them to select features based on information gain scores. We analyze the model performances for different numbers of features following the score ranking. Variable importance was averaged over 100 runs for the MTV-RF using a generalization of the permutation variable importance [33].
iv. Web application. The web application for building and using multilabel classifiers, called Multilabel_Tool, is available at https://danielapaula.shinyapps.io/Multilabel_Tool/, and was developed using the Shiny package from R. A sample of the R code files for creating the application is available at https://github.com/paula-daniela/Multilabel_Tool.git, as well as two examples of datasets that can be used as input files for building the models and making predictions.
On the developed web application, MTV-RF, BR, and CC classifiers are available, as well as measures of information gain (BR+IG), so that it is possible to evaluate the performance of the models for different sets of features selected from (BR+IG), and to perform hyparameter tuning by cross-validation. A brief tutorial of the web application is available in the appendix (SResults3 in S1 Appendix).

Results
Among the 14836 participants included in this study, 8347 (56.3%) had multimorbidity ( Table 1). The average age was 52.1 years, with a range of 34-75 years. Most of the participants with multimorbidity were women 5060 (60.6%) and had a higher mean age (53.2). Among the subgroups with the highest number of diseases, we observe a gradient of increase in the average age, and proportion of women ( Table 1).
Regarding the imbalance between occurrences, the dataset has a label cardinality of 1.86 and a density of 0.19. The mean IRLbl of the diseases was 3.82, indicating an average imbalance of approximately four between the occurrences of dyslipidemia and the other diseases. To evaluate resampling methods to deal with imbalance we applied random oversampling based on IRLbl for the MTV-RF classifier. Oversampling was applied to each partition of the training set generated during the stratified cross-validation procedure. We observed a slight  improvement in all performance measures, especially in subset accuracy, but, due to the computational cost, we did not apply the resampling method to the other classifiers. The results are in the appendix (SResults1 in S1 Appendix). The performance measures for the ML algorithms, considering all predictors, are presented in Table 2. In general, the methods based on RF, as adaptative as well as transformation, showed better performance than the methods based on SVM. The best performance was achieved for RF-CC.
For each feature, the info gain was evaluated using the BR+IG approach. BMI had the highest information gain, followed by systolic BP, waist-hip ratio, diastolic BP, and age. Sets of 5, 10, 15, 20, and 25 features were selected by BR+IG rank, to compare the performance of the methods on smaller sets of predictors. For most methods, the performance measures F-measure and accuracy become stable after 10 features, and for Hamming loss, from 15 features on. Regarding subset accuracy, SVM-based classifiers were the most stable, regardless of the number of features (Fig 2).
SVM-based transformation methods perform better concerning Hamming loss and subset accuracy for a few features (Fig 2). As the number of features increases, RF-based methods take the lead, besides performing better regarding accuracy and F-measure, even for small numbers of features. RF-BR is one of the best methods concerning accuracy and F-measure for all feature sets. RF-BR is among the best methods regarding Hamming loss from 15 features, with performance similar to RF-CC and RF-DBR, and concerning subset accuracy, from  22 features. The difference is that the last two methods, which include dependency between labels, have higher subset accuracy than RF-BR, with RF-CC outperforming RF-DBR from 25 features on.
Analogously to the BR+IG rank, the variable importance for the MTV-RF showed that BMI was the best predictor for multimorbidity, followed by systolic BP, age, diastolic BP, and sex (Fig 3). Furthermore, the sets of the first top eight predictors were equal by both importance and BR+IG ranks. The importance decreases for the other features with values close to zero.  To exemplify the classification rule and interpretability of the MTV-RF classifier, a subsample of the study population was considered (n = 1340), for simplicity. The resulting classification rule and the trees generated for each disease considered are available in the appendix (SResults2 in S1 Appendix).

Discussion
Recent studies on multimorbidity have implemented ML techniques to identify patterns of association between chronic conditions [11][12][13]. However, knowledge about modeling and prediction is still limited. Few studies deal with predictions for multiple diseases, mostly used counts [9,10,39]. When predicting with counting, the full complexity of relationships between diseases and between diseases and predictors is neglected in exchange for a less complex modeling process. The difficulties in predicting multimorbidity and how to overcome these issues when building a classifier have not yet been reported. The present study is the first to present the main problems in predicting multiple diseases and how to solve them, besides providing a user-friendly web application for building and using prediction models.
We present a machine learning approach based on multilabel classifiers, and use it to build and compare the performance of seven classifiers for predicting multimorbidity. This methodology and developed web application can be used to build feasible models, for screening individuals with multimorbidity from a small set of characteristics easily accessible in clinical practice. In the context of LMIC´s, with scarce resources, where it is known that multimorbidity represents a high burden of disease, these models can help as a tool for prevention and targeting of interventions based on the modifiable factors [7,8]. Moreover, since the models can predict the disease's cooccurrence, they can support the design of integrated clinical practices of care instead of conventional fragmented strategies for each chronic disease, one of the main challenges for multimorbidity prevention and management [8].
The prevalence of multimorbidity observed in this study (56.3%) was higher than expected in Brazil (reported in previous studies between 16.8% and 24.2%) [10,[40][41][42][43]. Differences between prevalences can occur due to several reasons, such as the different sets of diseases and differences in variables such as age. The majority of participants with multimorbidity were women and had higher average age. Among participants with at least two chronic conditions, we observed a gradient of increase in the proportion of people with primary or secondary education, and decrease in average per capita income. Most studies found that age is a risk factor for multimorbidity as well as sex, however, there is still no consensus on other factors such as education and income [8].
Dyslipidemia, migraine and common mental disorder were the most prevalent diseases and were also among the most cooccurring. High cooccurrences were also found between diabetes and dyslipidemia, heart disease and dyslipidemia, migraine and common mental disorder, asthma and migraine, and joint problems and migraine. These results are in agreement with previous studies that reported a high prevalence of dyslipidemia, as well as associations between mental disorders, joint problems, migraine, and respiratory diseases [8,[41][42][43][44][45]. The patterns of association found are compatible with the patterns previously identified in Brazil and LMIC's named as "cardio-metabolic" and "musculoskeletal-mental" [41,44,45].
Considering the full set of features, the best performance was achieved for RF-CC. SVMbased transformation methods performed worse than both adaptive and transformation RFbased methods. Decision trees have already been indicated as a good method for multiple-disease classification over SVM-based methods, and our results support the choice of RF-based classifiers for multimorbidity prediction [14]. The analysis for the different feature sets confirmed that RF-based classifiers should be preferred over SVM-based classifiers from 15 predictors, selected by BR+IG. Despite the assumption of independence between diseases, RF-BR was consistently among the best predictors, with the advantage of easy implementation and scalability, However, RF-CC should be considered if subset accuracy is the most important metric, i.e. when we want to achieve the best performance regarding the accurate identification of all diseases a patient may have. The performance of the CC can decrease significantly for a dataset with a high number of labels and high dependency or cardinality, so more studies are needed for evaluation of this classifier for multimorbidity prediction [46,47].
BMI combined with age, sex, waist-hip ratio, systolic and diastolic BP, sleep symptom and sleeping problem composed the set of eight best predictors, for MTV-RF, according variable importance. They also had the highest information gains, which was expected given that they are measures based on decision tree algorithms. Such features, found to carry the highest importance, are related to chronic diseases and multimorbidity. Sex and age are commonly reported as risk factors for multimorbidity, BMI, BP, and waist-hip ratio are also pointed out as a risk factor, especially for cardio-metabolic multimorbidities, and sleep disturbances and multimorbidity have been related previously, in particular associated with neuropsychiatric and musculoskeletal conditions [8][9][10][48][49][50].
Our study has some limitations. Because this is a cross-sectional study, it is not possible to identify a well-established cause-effect relationship between the variables. In general, ML methods use features to predict a response, but they may not be causal factors, so the directionality between predictor and response is difficult to establish. It should be noted that the importance of the variables found is limited to the model and population studied, using other algorithms may change the relevance of the variables. Therefore, although ML provides some insights into risk factors, it is not a conclusive analysis for this purpose.
Despite these limitations, our study is the first to present a ML approach to solve the main problems in multimorbidity prediction in a scenario where the number of chronic diseases is the most used outcome in predictions. The ML methodology used and web application developed are comprehensive, and can be applied to any clinical field in which multiple outcomes are considered. Starting from a large sample size, which makes the ML process more robust, we show that it is possible, from a small set of features, that are easy to collect in clinical practice, to build a feasible tool for multimorbidity prediction.